{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Processing for NeMo 2.0 LLMs with the SlimPajama Dataset\n",
    "\n",
    "This tutorial will guide you through the process of transforming a raw pretraining dataset into a configured data module for pretraining with a NeMo 2.0 recipe. We will use the [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B>) dataset as our reference. Additionally, we will demonstrate how to exclude specific sources from the dataset, such as excluding all data from the `RedPajamaBook` set by default.\n",
    "\n",
    "This tutorial involves four steps:\n",
    "\n",
    "1. Download data\n",
    "2. Extract data\n",
    "3. Concatenate data\n",
    "4. Preprocess data for NeMo 2.0/Megatron\n",
    "\n",
    "First, we'll define each step. Next, we will see how we can use NeMo-Run to execute the steps sequentially on your local workstation using Docker or on Slurm.\n",
    "\n",
    "### Prerequisites\n",
    "This notebook assumes familiarity with [NeMo-Run](https://github.com/NVIDIA/NeMo-Run). Additionally, the Docker execution and Slurm execution steps require access to Docker on your host and a remote Slurm cluster, respectively.\n",
    "Additionally, you will have to complete the following steps:\n",
    "\n",
    "1. Set HOST_DATA_PATH in the first cell to a parent folder on your workstation where you want to save the data.\n",
    "1. Create directories `HOST_DATA_PATH/tokenizer` and `HOST_DATA_PATH/slimpajama`.\n",
    "1. Download the Llama `tokenizer.model` file either from [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b/blob/main/tokenizer.model) or https://www.llama.com/llama-downloads/ and place it at `{HOST_DATA_PATH}/tokenizer/tokenizer.model`.\n",
    "    For HF, you can do it by running \n",
    "    ```bash\n",
    "    HF_TOKEN=... huggingface-cli download meta-llama/Llama-2-7B tokenizer.model --local-dir {HOST_DATA_PATH}/tokenizer/\n",
    "    ```\n",
    "\n",
    "> [!NOTE]\n",
    "> All code for this tutorial can be found at https://github.com/NVIDIA/NeMo/tree/main/examples/llm/slimpajama."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nemo_run as run\n",
    "\n",
    "from data.download import download_slimpajama\n",
    "from data.extract import run_extraction\n",
    "from data.preprocess import preprocess_data\n",
    "\n",
    "HOST_DATA_PATH = \"/data\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download Data\n",
    "\n",
    "First, we will configure the task to download data from Hugging Face. We will use the Hugging Face CLI for this. The function that configures the download script can be found [here](./data/download.py)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "download_task = download_slimpajama(\n",
    "    include_pattern='--include \"train/chunk1/*_100*zst\"',\n",
    ")\n",
    "\n",
    "# The configured script looks like below\n",
    "print(download_task.inline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extract Data\n",
    "\n",
    "The downloaded data is in compressed ZST format. We need to extract it into JSONL files. For that, we will configure the `extract_data` function defined [here](./data/extract.py). This function also allows excluding certain sources. By default, we exclude all data from the `RedPajamaBook` set, but this setting is configurable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run_extraction??"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "extract_task = run.Partial(run_extraction, data_dir=\"/data/slimpajama\")\n",
    "extract_task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concatenate Data\n",
    "\n",
    "This optional step concatenates small JSONL files into a single large JSONL file. The example script is [here](./data/concat.sh), but feel free to change it based on your needs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "concat_task = run.Script(\"/nemo_run/code/data/concat.sh\", args=[\"/data/slimpajama/train\", \"1\"])\n",
    "concat_task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocess Data\n",
    "\n",
    "This final step preprocesses the JSONL files to the BIN and IDX files required by NeMo and Megatron Core. It uses the `preprocess_data` function defined [here](./data/preprocess.py)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocess_data??"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocess_task = run.Partial(\n",
    "    preprocess_data,\n",
    "    data_dir=\"/data/slimpajama\",\n",
    "    output_dir=\"/data/slimpajama_megatron\",\n",
    "    tokenizer_model=\"/data/tokenizer/tokenizer.model\",\n",
    "    tokenizer_library=\"sentencepiece\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocess_task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Put it all together\n",
    "\n",
    "Now that all the tasks are configured, lets define an executor to run them on and an experiment to run them sequeuntially. \n",
    "\n",
    "> [!NOTE]\n",
    "> Each task can be run individually or in any combination. The notebook runs all tasks sequentially. To remove a task, just remove the corresponding `exp.add(...)` for that corresponding task.\n",
    "> This customization is handy if you already have JSONL files processed, for example, from NeMo-Curator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's define a local executor to run the experiment locally.\n",
    "def docker_executor(host_data_path: str):\n",
    "    packager = run.GitArchivePackager(subpath=\"examples/llm/slimpajama\") # This will package all code inside the folder. NOTE: only committed changes are packaged, so if you make a change, make sure to commit it.\n",
    "    executor = run.DockerExecutor(\n",
    "        packager=packager,\n",
    "        ipc_mode=\"host\",\n",
    "        shm_size=\"30g\",\n",
    "        env_vars={\"PYTHONUNBUFFERED\": \"1\"},\n",
    "        volumes=[f\"{host_data_path}:/data\"],\n",
    "        container_image=\"python:3.11\",\n",
    "        ulimits=[\"memlock:-1\", \"stack:67108864\"],\n",
    "    )\n",
    "    return executor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace the host_data_path with the path on your host to save the data to.\n",
    "executor = docker_executor(host_data_path=\"/data\")\n",
    "\n",
    "with run.Experiment(\"slimpajama-data-pipeline\") as exp:\n",
    "    exp.add(download_task, name=\"download_slimpajama\", executor=executor)\n",
    "\n",
    "    # Use NeMo image for the remaining tasks\n",
    "    executor.container_image = \"nvcr.io/nvidia/nemo:dev\"\n",
    "    exp.add(extract_task, name=\"extract_slimpajama\", executor=executor)\n",
    "\n",
    "    # examples/llm/slimpajama is automatically mounted to /nemo_run/code\n",
    "    exp.add(concat_task, name=\"concat_slimpajama\", executor=executor)\n",
    "    exp.add(preprocess_task, name=\"preprocess_slimpajama\", executor=executor)\n",
    "\n",
    "    exp.run(sequential=True, tail_logs=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the experiment runs successfully, you will see the BIN and IDX files as shown below. These files can directly be used in NeMo and Megatron Data Loaders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "concatenated_chunk1.jsonl_text_document.bin\n",
      "concatenated_chunk1.jsonl_text_document.idx\n"
     ]
    }
   ],
   "source": [
    "!ls {HOST_DATA_PATH}/slimpajama_megatron"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix\n",
    "\n",
    "### Running on Slurm\n",
    "\n",
    "You can also run the same experiment on a remote cluster like Slurm by replacing the Docker executor with a Slurm executor. A sample definition of a Slurm executor looks like:\n",
    "\n",
    "```python\n",
    "def slurm_executor(\n",
    "    user: str,\n",
    "    host: str,\n",
    "    remote_job_dir: str,\n",
    "    account: str,\n",
    "    partition: str,\n",
    "    nodes: int,\n",
    "    tasks_per_node: int,\n",
    "    time: str = \"04:00:00\",\n",
    "    custom_mounts: Optional[list[str]] = None,\n",
    "    custom_env_vars: Optional[dict[str, str]] = None,\n",
    "    container_image: str = \"nvcr.io/nvidia/nemo:dev\",\n",
    "    retries: int = 0,\n",
    ") -> run.SlurmExecutor:\n",
    "    if not (user and host and remote_job_dir and account and partition and nodes and tasks_per_node):\n",
    "        raise RuntimeError(\n",
    "            \"Please set user, host, remote_job_dir, account, partition, nodes and devices args for using this function.\"\n",
    "        )\n",
    "\n",
    "    mounts = []\n",
    "    if custom_mounts:\n",
    "        mounts.extend(custom_mounts)\n",
    "\n",
    "    env_vars = {\n",
    "        \"NVIDIA_VISIBLE_DEVICES\": \"void\", # Might be needed for CPU only nodes with NeMo docker image\n",
    "    }\n",
    "    if custom_env_vars:\n",
    "        env_vars |= custom_env_vars\n",
    "\n",
    "    executor = run.SlurmExecutor(\n",
    "        account=account,\n",
    "        partition=partition,\n",
    "        tunnel=run.SSHTunnel(\n",
    "            user=user,\n",
    "            host=host,\n",
    "            job_dir=remote_job_dir,\n",
    "            identity=\"/path/to/identity/file/for/ssh/to/cluster\",  # OPTIONAL: Provide path to the private key that can be used to establish the SSH connection without entering your password\n",
    "        ),\n",
    "        nodes=nodes,\n",
    "        ntasks_per_node=tasks_per_node,\n",
    "        mem=\"0\",\n",
    "        exclusive=True,\n",
    "        packager=run.GitArchivePackager(subpath=\"examples/llm/slimpajama\"),\n",
    "    )\n",
    "\n",
    "    executor.container_image = container_image\n",
    "    executor.container_mounts = mounts\n",
    "    executor.env_vars = env_vars\n",
    "    executor.retries = retries\n",
    "    executor.time = time\n",
    "\n",
    "    return executor\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
